Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCRE2: optimize memory allocations #15395

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Jan 31, 2025

We noticed in #15088 that we don't need Crystal::ThreadLocalValue in the Regex PCRE2 engine.

We can reuse the JIT stack but also the match data for every Regex (no need for a specific match data per Regex). We can allocate one and make sure it's not used by other threads... hence the thread locals: no more spinlock (thread contention) nor hash.

It's simpler and faster. Here's the benchmark from #13144 for example:

$ crystal run --release bench/regex.cr
starts_with?  29.46M ( 33.94ns) (± 0.27%)
    matches?  40.34M ( 24.79ns) (± 0.74%)
$ bin/crystal run --release bench/regex.cr
starts_with?  30.70M ( 32.58ns) (± 0.26%)
    matches?  46.28M ( 21.61ns) (± 0.72%)

Enabling MT also no longer has any impact on performance:

$ crystal run --release -Dpreview_mt bench/regex.cr
starts_with?  26.50M ( 37.73ns) (± 0.09%)
    matches?  41.75M ( 23.95ns) (± 0.56%)
$ bin/crystal run --release -Dpreview_mt bench/regex.cr
starts_with?  30.61M ( 32.67ns) (± 0.36%)
    matches?  48.45M ( 20.64ns) (± 1.51%)

The drawback is that we must allocate each matchdata with a maximum number of ovectors (65535). That might increase memory usage, though I failed to notice it in practice. Maybe not allocating memory for every regular expression is helping?

Note: this PR will be separated into a couple PRs to introduce Thread::Local(T) separately. The point of this new type is for this patch, so I want approval on the overall approach before the split.

I could have used @[ThreadLocal] but some targets don't support it (namely: Android, MinGW and OpenBSD) and we can't register destructors either (on thread shutdown). But using pthread_key_create or FlsAlloc we can 😍

Note 2: using a type and a system destructor allows to avoid adding a property to Thread that would then need a finalizer that would hardlink and force require PCRE2 into all Crystal applications again. Using system destructors also means that the destructor will always run, even if the thread is cancelled or whatever.

Note 3: I'm wondering if this Thread::Local(T) type ain't deprecating the @[ThreadLocal] annotation: it works everywhere and the annotation doesn't even spare a function call —the codegen marks it as noinline to avoid cache issues.

@ysbaddaden ysbaddaden self-assigned this Jan 31, 2025
@ysbaddaden ysbaddaden changed the title Refactor: PCRE2 memory allocations PCRE2: optimize memory allocations Jan 31, 2025
@ysbaddaden
Copy link
Contributor Author

Damn, that was promising, and it's working well on Linux, but both Darwin and Windows completely broke 😭

Windows could be explained by using FlsAlloc instead of TlsAlloc, but Darwin?

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Feb 4, 2025

WTH: I can't reproduce the errors in my Windows VM (MSVC) 😕

@ysbaddaden
Copy link
Contributor Author

I can reproduce when running the compiler specs. Looking at the memory graph, the committed memory keeps growing rapidly, and it starts failing exactly when reaching 32GB (the VM has 16GB of RAM). That can't be a coincidence.

When running the compiler specs from master, the committed memory is rather stable or grows slowly (around 3.5GB). Let's note that a linker command still fails, but I get a bunch of weird spec errors in the VM.

Maybe FlsAlloc ain't exactly compatible with our fiber-context switch (argh)?

@ysbaddaden ysbaddaden force-pushed the refactor/pcre2-memory-allocations branch from 7ceca47 to 7487cb5 Compare March 11, 2025 07:43
@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Mar 11, 2025

Here's a new attempt.

From CI runs in my fork, the darwin and win32-msvc targets got fixed, but win32-gnu is still broken on dual heredocs such as call <<-CRYSTAL, <<-CRYSTAL which I don't understand 😞

Note: I believe FLS should work on Windows, which would avoid the custom destructors. The stdlib specs pass, save for these failures in the compiler specs.

@HertzDevil
Copy link
Contributor

Yeah that's the LLVM 20 failure, ignore it for the moment

@ysbaddaden
Copy link
Contributor Author

I tried again, and FLS is apparently working on Windows. Nice.

Thread.threads.delete(self)
Fiber.inactive(fiber)
detach { system_close }
end
end

protected def run_destructors : Nil
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: cleanup before merge (not needed anymore).

@ysbaddaden ysbaddaden marked this pull request as ready for review March 11, 2025 13:36
Comment on lines +3 to +9
# def initialize
# {% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
# end

# def initialize(&destructor : Proc(T, Nil))
# {% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
# end
Copy link
Contributor

@Sija Sija Mar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps these could be uncommented and then called via super in descendant classes to avoid duplicating the code?

Suggested change
# def initialize
# {% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
# end
# def initialize(&destructor : Proc(T, Nil))
# {% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
# end
def initialize
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
end
def initialize(&destructor : Proc(T, Nil))
initialize
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It avoids some repetition, but I'm not sure it would improve readability?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could use Box for this? Seems to be exactly the use case. Technically we don't need support for boxing structs onto the heap though. So this draws in additional complexity which we might want to avoid here.

Copy link
Contributor Author

@ysbaddaden ysbaddaden Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you lost me @straight-shoota 😕

Do you mean to Box the destructor proc? Oh, to Box the value in case it's not a Pointer/Reference! 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to not support non pointer values. If we ever need to, we'll figure it out then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea for using Box would be that it might simplify the code for Thread::Local. All explicit casts would become Box(T).box and Box(T).unbox and no need for checking T type in the macro.
I suppose it doesn't simplify that much though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But Box always allocates into the GC heap, and we'd put the pointer to the box in the local storage that we'd have to dereference on every get 😕

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errata: Box doesn't allocate into the GC heap when T is Nil or Reference but we want to store pointers and that will allocate.

Maybe it's an overlook? 🤔

Comment on lines +7 to +17
def initialize
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
err = LibC.pthread_key_create(out @key, nil)
raise RuntimeError.from_os_error("pthread_key_create", Errno.new(err)) unless err == 0
end

def initialize(&destructor : Proc(T, Nil))
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
err = LibC.pthread_key_create(out @key, destructor.unsafe_as(Proc(Void*, Nil)))
raise RuntimeError.from_os_error("pthread_key_create", Errno.new(err)) unless err == 0
end
Copy link
Member

@straight-shoota straight-shoota Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Extract the common code into a helper to remove duplication.

Suggested change
def initialize
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
err = LibC.pthread_key_create(out @key, nil)
raise RuntimeError.from_os_error("pthread_key_create", Errno.new(err)) unless err == 0
end
def initialize(&destructor : Proc(T, Nil))
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
err = LibC.pthread_key_create(out @key, destructor.unsafe_as(Proc(Void*, Nil)))
raise RuntimeError.from_os_error("pthread_key_create", Errno.new(err)) unless err == 0
end
def initialize
create_key
end
def initialize(&destructor : Proc(T, Nil))
create_key destructor
end
private def create_key(destructor = nil)
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
err = LibC.pthread_key_create(out @key, destructor.unsafe_as(Proc(Void*, Nil)))
raise RuntimeError.from_os_error("pthread_key_create", Errno.new(err)) unless err == 0
end

ditto for win32

Comment on lines +7 to +17
def initialize
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
@key = LibC.FlsAlloc
raise RuntimeError.from_winerror("FlsAlloc: out of indexes") if @key == LibC::FLS_OUT_OF_INDEXES
end

def initialize(&destructor : Proc(T, Nil))
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
@key = LibC.FlsAlloc(destructor.unsafe_as(LibC::FLS_CALLBACK_FUNCTION))
raise RuntimeError.from_winerror("FlsAlloc: out of indexes") if @key == LibC::FLS_OUT_OF_INDEXES
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def initialize
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
@key = LibC.FlsAlloc
raise RuntimeError.from_winerror("FlsAlloc: out of indexes") if @key == LibC::FLS_OUT_OF_INDEXES
end
def initialize(&destructor : Proc(T, Nil))
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
@key = LibC.FlsAlloc(destructor.unsafe_as(LibC::FLS_CALLBACK_FUNCTION))
raise RuntimeError.from_winerror("FlsAlloc: out of indexes") if @key == LibC::FLS_OUT_OF_INDEXES
end
def initialize
create_key nil
end
def initialize(&destructor : Proc(T, Nil))
create_key destructor
end
private def create_key(destructor)
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
@key = LibC.FlsAlloc(destructor.unsafe_as(LibC::FLS_CALLBACK_FUNCTION))
raise RuntimeError.from_winerror("FlsAlloc: out of indexes") if @key == LibC::FLS_OUT_OF_INDEXES
end

Comment on lines +3 to +9
# def initialize
# {% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
# end

# def initialize(&destructor : Proc(T, Nil))
# {% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %}
# end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea for using Box would be that it might simplify the code for Thread::Local. All explicit casts would become Box(T).box and Box(T).unbox and no need for checking T type in the macro.
I suppose it doesn't simplify that much though.

@ysbaddaden
Copy link
Contributor Author

Moving back to draft. I'll be pushing individual PRs with the changes proposed here:

A branch with all the changes is available at https://github.com/ysbaddaden/crystal/tree/feature/add-thread-local-struct

@ysbaddaden ysbaddaden marked this pull request as draft March 17, 2025 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Review
Development

Successfully merging this pull request may close these issues.

5 participants